White Wine Dataset Report by Anthony Munnelly

Univariate Plots Section

I will plot all the variables to see what their distrubtions are like. Density plots are better suited than histograms to continuous variables such as these.

Univariate Analysis

What is the structure of your dataset?

This is the structure of the data,

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : Factor w/ 7 levels "3","4","5","6",..: 4 4 4 4 4 4 4 4 4 4 ...

and these are summaries of each variable in the data set.

##        X        fixed.acidity    volatile.acidity  citric.acid     residual.sugar     chlorides      
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600   Min.   :0.00900  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700   1st Qu.:0.03600  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200   Median :0.04300  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391   Mean   :0.04577  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900   3rd Qu.:0.05000  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800   Max.   :0.34600  
##                                                                                                      
##  free.sulfur.dioxide total.sulfur.dioxide    density             pH          sulphates         alcohol      quality 
##  Min.   :  2.00      Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00   3:  20  
##  1st Qu.: 23.00      1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50   4: 163  
##  Median : 34.00      Median :134.0        Median :0.9937   Median :3.180   Median :0.4700   Median :10.40   5:1457  
##  Mean   : 35.31      Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51   6:2198  
##  3rd Qu.: 46.00      3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40   7: 880  
##  Max.   :289.00      Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20   8: 175  
##                                                                                                             9:   5

What is/are the main feature(s) of interest in your dataset?

The main features of interest in the white wine dataset are quality and alcohol. These are the two most likely reasons why people buy and drink wine in the first place.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Theory suggests that sulphates and residual sugar have strong influences on the quality and alcohol content of a wine. I hope to investigate this theory.

Did you create any new variables from existing variables in the dataset?

I changed the nature of quality from an integer to a factor. This is a table of the results.

## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

There was an unusual distribution for those wines whose quality was ranked at 9. However, this was due to the fact that there were only five of these in the sample, and this caused the distortion. pH is a nearly-normal distribution of median value 3.18 and average value 3.188. The rest of the data are heavily right-skewed, with low mean and median values.

Bivariate Plots Section

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The first feature noticed was how the small sample size (five) for wines of Quality 9 distorted the results. As such, these five were removed to generate the graphs in this part of the report.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

There is considerable variance between quality and alcohol, suggesting that the quality of a wine is dependent on its alcohol content.

What was the strongest relationship you found?

The strongest relationship I found was that between density and alcohol, a relationahip which has a Pearson’s Coefficient of -0.7801376.

Multivariate Plots Section

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Alcohol is a feature of quality wine. Alcohol content increases with quality, but certainly not in any mathematical sense.

Acidity or baseness makes no different to alcohol content, irrespective of quality. It is hard to make a strong mathematical case for any strong relationships in the dataset, other than the inverse correlation between density and alcohol.

Were there any interesting or surprising interactions between features?

The relationship between residual sugar and alcohol is easier to see when residual sugar is plotted on the y-axis and alcohol on the x. This is counter-intuitive, as sugar is an independent variable and alcohol dependent in wine-making.

Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I did not create any model with the dataset. The only feature of the dataset that could be modeled is the inverse correlation between density and alcohol. That relationship is only of interest to chemists, and chemists are probably already aware of it. There is no correlation extant between other variables that is strong enough from which a model could be created.


Final Plots and Summary

Plot One: Density v Alcohol

Description

This graph plots density against alcohol for the sample data. The points on the scatter plot are set at 0.1 to reduce over-plotting. A trendline is added to show the strong negative corrrelation between density and alcohol using the stat_smooth() function.

Plot Two: Residual Sugar v Alcohol

Description

Residual sugar is plotted against alcohol in a scatterplot graph. The points are colored according to their quality to add further information to the graph, and are set at alpha = 0.5 to reduce over-plotting.

Plot Three: Alcohol v Quality

Description

This is a graph of the alcohol quantity in wine broken down by the quality of the wines in the sample set. The faceting of the graph makes it easy to compare one wine’s alcohol quantity to another’s. Strong colors are used in the graph to make the data jump out.


Reflection

The only strong correlation in this dataset is between density and alcohol, which is not a factor when people are buying wine. While this is a disappointment to statisticians, it is almost certainly good news for sommeliers, who can be reassured that their specialty is indeed more art than science.

This was not an easy dataset for someone who is neither a chemist nor an oenophile to work with. I studied chemistry in high school, but figuring out the relationships between the variables was the child of research and guesswork.

The absence of units of measurement is a drawback to the dataset – what are these variables measured in? Sulfur is measured in parts per million – does this apply to chlorides, sugars and acids too? What is the basis of the quality scale?

Quality was also a difficult factor to deal with as the sample is so dominated by wines of Quality 6.